Viewing the first 6 rows in the dataset to see how the data looks like.
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.4 0.70 0.00 1.9 0.076
## 2 2 7.8 0.88 0.00 2.6 0.098
## 3 3 7.8 0.76 0.04 2.3 0.092
## 4 4 11.2 0.28 0.56 1.9 0.075
## 5 5 7.4 0.70 0.00 1.9 0.076
## 6 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
There are 11 chemicals in the table that can influence the quality of red wine. For instance acidity level, sugar, PH level, alcohol, etc. The last column in the table specifies the quality (score between 0 to 10) of the wine based on all the chemicals.
Now let’s get some info on different types of data in the dataset. According to the table below, row number and quality are of type integer, while the rest of the variables are numeric.
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
I would like to remove the X vector from the dataset as it only holds row numbers in the table which is no use for me during my investigation.
I create some basic histograms for each chemical property in the table.
First I like to see how the quality property is distributed in the table. Quality of red wine is one of the main point of interest for me in this dataset, as I’d like to see what properties in red wine can influence the quality.
Distribution of Red Wines Based on Quality
Based on the plot the majority of the red wines -about more than 1200- have either 5 or 6 ranking for their quality. About 200 red wines have quality ranking as 7, but the number is not as significant as 5 and 6. The rest are either 3,4 or 8.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
The average quality of red wines in the database is 5.63, with the maximum ranking set to 8 and the minimum set to 3. Therefor, there is no red wine in the database with a quality of 10 or 0.
Since the quality rating is from 0-10, I like to create a rating system, from 1-10 (as there are no wines with rating zero in the data), and categories these numbers into ratings like this:
1-2 : Poor 3-4 : Below Average 5-6 : Average 7-8 : Above Average 9-10 : Excellent
The first 20 rows of my dataset for the rating column now looks like this:
## [1] "Average" "Average" "Average" "Average"
## [5] "Average" "Average" "Average" "Above Average"
## [9] "Above Average" "Average" "Average" "Average"
## [13] "Average" "Average" "Average" "Average"
## [17] "Above Average" "Average" "Below Average" "Average"
After looking at the quality distribution, I’d like to get a histogram of all the chemical properties - as well as quality - in the table. These histograms can also help me in finding out if any of values in the plots need to be changed to e.g. log10 for a smoother distribution.
Distribution of Quality and All Chemical Properties
Fixed Acidity distribution has most of its values from 5-12, with a few more than 12, and some outliers around 16. I apply the log10 function to its x-axis to make the distribution look normal.
Fixed Acidity Level - Comparing Log10 to the Original Distribution
Volatile Acidity distribution has most of its values between 0 and 0.8, with a few going to 1, and then outliers after 1 to 1.6.
By getting the square root of the values, I can change the distribution towards a normal one. However, the outliers still exist after 1.00
Volatile Acidity Level - Comparing Square Root to the Original Distribution
Citric Acid values are distributed along the plot on its x-axis, starting from zero (with its most count) to 0.75. Seems there are outliers for citric acid level 1.00
The plot has a its rise and fall throughout the x-axis. The most count contains citric acid level of zero. There are also two more visible peaks in count around 0.25 and 0.5 values.
Distribution of Citric Acid Level
Residual Sugar distribution has the majority of the count around 1.5-2.5. There are many outliers around 7 and onwards.
Distribution of Residual Sugar
The plot looks like a normal distribution that is skewed to the left.
The outliers can have a direct influence on change the average value for residual sugar. Let’s take a look at the summary for residual sugar to find out how the average is influenced.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
According to the plot for residual sugar, with most of the values concentrated around 1.5-2.5, I was expecting the mean to fall around this range as well. However, the existance of the outliers was an interesting factor for me to check to see how much they can influence the mean. According to the summary, the mean still falls around 2.5 inspite of the outliers. This can make sense as the density of the plot is more significant in 1.5-2.5 range, than around the outliers.
Chlorides distribution has most of the values concentrated around 0.03-0.15. There are a few outliers especially after 0.35. I would like to check the summary to see how the mean value is influenced by the outliers.
Distribution of Chlorides
The plot shows that the distribution is almost normal and a bit skewed to the left.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
I expected that the mean value would fall somewhere between 0.05-0.10, and according to the summary, it’s correct. The mean value of 0.08 is in my expected range.
Total Sulfur Dioxide distribution is skewed to left. The rise and fall of the distribution is not that much; however, there are a few noticable peaks in count around 30-40, 60-80, and 80-90.
I can use log10 on the x-axis to change the distribution to normal. Let’s see how this will look like.
Distribution of Total Sulfur Dioxide - Comparing Original to Log10
Based on the plots for Total Sulfur Dioxide, using log10 on the x-axis helped the distribution to look normal-ish. There are still a few peaks in counts, but the overall shape of the distribution looks normal.
Free Sulfur Dioxide distribution is also skewed to the left, with most of the values concentrated around 5-15.
Applying log10 on the x-axis does not work for Free Sulfur Dioxide, though. It changes the distribution, but seems the distribution goes more skewed to the right, rather than getting normally distributed.
Distribution of Free Sulfur Dioxide - Comparing Original to Log10
The Density distribution looks like a normal distribution.
Density Distribution
Getting the summary of the distribution, I have:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
The mean of the distribution is 0.9967 which looks like it’s almost in the middle of the distribution. The median is different than mean by 0.0011, which is also almost close to the middle of the distribution.
The PH distribution looks normal, with a few outliers.
PH Level Distribution
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
The summary of the distribution shows the mean value falls at 3.31, very close to the median value.
The Sulphates distribution has most of its values concentrated around 0.5-0.7, and a few outliers around 1.6 and 2.0. I also check the summary of the table to see where the mean and median are placed.
Distribution of Sulphate
The distribution looks normal-ish and is a bit skewed to the left.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
The mean at 0.65 is within the range where most of the values are. The median at 0.62 looks also close to the mean.
The Alcohol distribution has its falls and rises, with its most values around 9.5% of alcohol.
Alcohol Distribution
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
The summary of the alcohol table shows the mean alcohol percentage is at 10.42, and the median is 10.20.
Now, I’ll answer a few questions regarding the database and my initial analysis.
The dataset has 1599 observations and 13 variables. Since I removed column X, and added column rating, the number of variables stay the same.
From the 13 variables, 11 of them are about chemical properties of red wines, 1 is about how the quality of red wine changes based on these chemical properties, and another one is where the red wines fall in the rating system based on their quality.
For me, the two important factors of interest are quality and rating. Further in the analysis I would like to see how the quality changes based on each chemical property.
Also, I would like to see if the alcohol level have any correlations with the quality of red wines.
For quality and rating, all the chemical properties can play an important role in supprting my investigation.
Yes, I created a new column rating which holds a rating system from Poor to Excellent based on the quality of red wines.
X column since it was only holding the row numbers and was not playing any role in my investigationrating column to later categories the red wines in quality based on their ratingsWith the plots in this section, I would like to dig more into seeing how the chemical properties can affect each other, or the quality of the wine.
For an overview, using corrplot function, I want to create a plot that can show me the correlation of chemical properties to each other.
Correlation Plot for Each Chemical Property and Quality of Red Wine
Putting aside the correlation of each property with itself(i.e. 1), let’s look at the rest of the patterns.
The circle color for correlation between Total Sulfur Dioxide and Free Sulfur Dioxide is towards the darket blue shade which shows a positive correlation.
## [1] 0.6676665
Citric Acid and Fixed Acidity also have a positive correlation. However, both have a negative correlation with the Volatile Acidity.
The statistical calculations show the same thing.
Correlation between Citric Acid and Fixed Acidity:
## [1] 0.6717034
Correlation between Citric Acid and Volatile Acidity:
## [1] -0.5524957
Correlation between Fixed Acidity and Volatile Acidity:
## [1] -0.2561309
Other correlations of interest are :
PH level and Fixed Acidity, as well as PH level and Citric Acid, both having a negative correlations.
Density and Alcohol level also have a negative correlation with each other.
Quality seems to have a positive correlation with Alcohol level.
Using three plots I display the relationship between each of the acidity levels.
Relation between Fixed Acidity and Citric Acid
Fixed Acidity and Citric Acid level seem to have a positive correlation with each other; meaning as the fixed acidity pf red wine increases, the citric acid level also increases.
Relation between Fixed Acidity and Volatile Acidity
Fixed acidity and volatile acidity have a negative correlation with each other. Higher levels of volatile acidity has lower levels of fixed acidity.
Relation between Citric Acid and Volatile Acidity
Citric acid and volatile acidity have a negative correlation with each other as well.
Comparing Acidity with PH Levels
PH level has a positive correlation with volatile acidity and negative correlations with citric acid and fixed acidity which makes sense.
The lower the PH level, the higher is its acidic property. High amount of volatile acidity also causes an unpleasant and vinegar-like taste in red wines. Therefor, it makes sense to see that higher level of acidity (i.e. lower PH level) in PH correlates with higher level of volatile acidity.
In contrast, lower level of acidity in PH (i.e. higher PH level) has a negative correlation with citric acid and fixed acidity levels.
Residual sugar and density seem to have a positive correlation with each other. As the density increases, the redisual sugar increases as well.
Relation between Residual Sugar and Density
If I also calculate the numeric value of the correlation, it’s 0.355. This correlation falls within the range that is not very significant.
Density has a negative correlation with the amount of alcohol in red wine, meaning as the density of red wine increases, its amount of alcohol decreases.
Relation between Alcohol Level and Density
Quality of Red Wine Based on Its PH Level
I expected to see a wider range for quality 5 and 6 in comparison to other categories, but according to the quality vs. ph plot, it looks like the range of PH level is more or less the same for each category in quality.
The highest median is for quality 3, and the lowest is for quality 8. I expected to see higher PH level for wine quality 8, which means its less acidic, but seems that’s not the case.
I calculate the correlation between quality and ph level to double-check my observation from the plot. Seems quality and ph levels are hardly correlated.
## [1] -0.05773139
Quality of Red Wine Based on Its Alcohol Level
The highest median for alcohol belongs to a range with quality 8. This supports the positive correlation between wine quality and the level of alcohol.
Quality level 8 also seems to have the widest range among all other categories.
Quality of Red Wine Based on Its Volatile Acidity Level
The lowest median for volatile acidity belongs to the highest quality of red wine, which makes sense since the lower the volatile acidity level, the higher will be the quality of wine.
The range of wines with quality 3 or 4 have the widest range of volatile acidity levels.
I sum up the bivariate section by answering a few more questions about the dataset.
Quality of red wine correlates strongly with the amount of alcohol, and correlates negatively with the volatile acidity level.
PH levels correlate strongly to the citric, fixed and volatile acidity levels. For the unpleasant acidity level (i.e. volatile acidity) the PH level is low, while for citric and fixed acidity level (i.e. plesant acidity properties) PH level stays higher.
I expected to see PH levels correlating with the quality of wine, but that was not the case. They were hardly correlated.
I observed that the amount of density is correlated negatively with the amount of alcohol; the more the density, the less the alcohol level. Also, density has a positive correlation with residual sugar level.
From the variables analyzed, the strongest relationship was between Citric Acid and Fixed Acidity, which had a correlation coefficient of 0.67
Also, PH level and Fixed Acidity had a strong negative correlation of -0.68 with each other
Based on some of the relationships found between the properties in bivariate plots, I take my analysis further by comparing those relationship with regards to the quality of wine and ratings. I am very excited to specifically dive more into data from Volatile Acidity, Citric Acid, Fixed Acidity, PH, Alcohol, and Density, and find out how the combination of them can influence the quality.
I first look into different acidity and alcohol levels.
Volatile acidity in high amounts seems to lower the quality of wine. High amount of alcohol on the other hand leads to a better wine quality.
Comparison of Volatile Acidity and Alcohol Level with Regards to Quality
Looks like a higher amount of volatile acidity can decrease the influence of high amount of alcohol. As the volatile acidity goes higher, the quality of wine drops, but as the volatile acidity goes lower and alcohol level increases, the quality incrases.
Since the existance of citric acid adds freshness to wine, it can have a positive correlation with its quality. Now, I would like to see how it compares with volatile acidity levels.
Comparison of Citric Acid and Volatile Acidity with Regards to Quality
Looks like high amount of volatile acidity can have more influence on the quality of wine than citric acid. For instance, looking at the ‘Below Average’ section, there are points with high level of citric acid (from its median 0.26 to 0.50), but due to high level of volatile acidity, they are categorized in the below average section.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
## [1] "Correlation between Citric Acid and Quality"
## [1] 0.2263725
## [1] "Correlation between Fixed Acidity and Quality"
## [1] 0.1240516
## [1] "Correlation between Citric Acid and Fixed Acidity"
## [1] 0.6717034
Fixed acidity and citric acid have a strong correlation of 0.671, while both have positive correlations with quality. Citric acid has a stronger positive correlation with quality (0.226) than what fixed acidity has with it (0.124).
Comparison of Citric and Fixed Acidity with Regards to Quality
The positive correlation between citric acid and fixed acidity with each other and with quality is also visible in this plot. However, citric acid and fixed acidity together cannot influence the quality of wine. As seen in the ‘Below Average’ section, there are points on the plot where both citric acid is around 0.50 and fixed acidity is around 13 - both numbers are in the 90 percentile - but the quality of wine is at level 3 or 4 which is quite poor.
Another interesting point about this plot is how many points fall on the zero y-axis (i.e. citric acid level is zero), especially in Average and Below Average section. This makes me doubt whether it is a mistake, or that some high quality wines have zero level of citric acid. Does this mean no “freshness”?
## [1] "Fixed Acidity:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
## [1] "Citric Acid"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
PH is another interesting property for me since it defines the level of acidity and basic of wine.
## [1] -0.6829782
Fixed acidity and PH have a strong negative correlation with each other. Let’s take a look at their plot in comparison with quality of wine.
Comparison of PH Level and Fixed Acidity with Regards to Quality
Looks like very high levels of PH cannot necessarily influence the quality of wine. This is also visible from their correlation (-0.05) that is not significant.
## [1] "Correlation between PH and Quality"
## [1] -0.05773139
Also, high levels of fixed acidity, alone, cannot define the quality of wine.
High amount of total sulfur dioxide becomes visible in the nose and taste of wine. This is my idea means that it does not make wine have a better quality; however, I would like to check this to find out. To compare, I add alcohol as another metric to find out whether higher level of alcohol has any correlation with higher amount of total sulfur dioxide.
Comparison of Alcohol Level and Total Sulfur Dioxide with Regards to Quality
Looks like the less the amount of total sulfur dioxide and the higher the level of alcohol, the higher quality of wine we have. The statistical correlation of -0.205 between total sulfur dioxide and alcohol also supports the plot.
Comparison of PH Level and Density with Regards to Quality
An interesting point for me in this plot is to see that lower levels of PH (i.e. high acidity) can still cause a high density and good wine quality. At the same time, they can cause a bad quality wine as well. This means the influence of other chemical factors are more significant in defining the quality of wine.
Comparison of Density and Sulphates with Regards to Quality
## [1] "Correlation between Density and Sulphates"
## [1] 0.1485064
## [1] "Correlation between Density and Quality"
## [1] -0.1749192
## [1] "Correlation between Sulphates and Quality"
## [1] 0.2513971
Although it seems that higher density in wine does not correlate with higher quality of wine, sulphates level has a stronger influence on the quality. There are many points where the density is high, but because of higher sulphates level, the overall quality seems to be higher.
Earlier in the bivariate analysis, I observed that alcohol has a negative correlation with density. I would like to now observe how they behave when I compare both to PH level.
Comparison of Density and Alcohol with Regards to PH Level
Seems higher level of PH correlated with higher level of alchol and lower level of density, which altogether makes me think that the amount of alcohol plays a very important role in defining the overall quality of wine.
## [1] 0.2056325
Overall it looked like a few chemical properties such as alcohol, volatile acidity, ph, and citric acid had more influence in quality of the wine.
Alcohol was one of the chemical properties that caught me by surprise. Initially before doing the investigation on data, I guessed higher levels of alcohol could influence the taste and quality of wine in a negative way. However, in all the analysis I did on alcohol, the higher the level, the higher the quality of wine, and less the amount of other bad chemicals like volatile acidity and total sulfur dioxide.
Observing fixed acidity and PH showed there was a strong negative correlation of -0.682 between them; meaning the more basic the wine, the lower is its fixed acidity level. However, if the fixed acidity level is low, it bring a lower quality in wine.
Analyzing this data makes me see that excellent wines have higher level of alcohol, and citric acid, with lower levels of volatile acidity and density. Poor wines have higher levels of volatile acidity and lower level of sulphates and alcohol.
One of the interesting interactions was between volatile acidity and citric acid. The amount of volatile acidity has a more significant influence on quality of wine than citric acid. Those wines with higher volatile acidity also had less amount of citric acid.
Alcohol and PH level seem to have a positive correlation with each other. Higher amount of alcohol correlated with higher PH level. Before starting this analysis, I was guessing that higher amount of alcohol should bring more acidity to the taste of wine, but according to my analysis, it made me see a different pattern. High levels of alcohol does not mean that the wine taste more acidic.
Relationship between Alcohol and Total Sulfur Dioxide was of interest for me as well. Since total sulfur dioxide bring a nose to the taste of wine, I initially guessed that it could have a positive correlation with the amount of alcohol in that wine, which is not true.
No, I could not find any strong correlation that could help in identifying the quality of red wines. I felt more properties could have been added to the dataset to help draw a stronger conclusion on the factors that influence the quality of wine. Read more about this in the Reflection section.
In this section I have chosen three plots from my work and included a more in-depth analysis on them.
Comparing Alcohol Level to Density of Red Wines
One of the interesting relationships for me was to find out that the percentage of alcohol in red wines has a negative correlation with its density. The plot shows as the alcohol percentage of wine increases, its density decreases.
Alcohol percentage is a significant factor in defining the quality of wine and has a positive correlation with it. Therefore, this means high quality wines naturally might have less density. Before analyzing this dataset, I was putting my guess more towards higher density means better quality, but this analysis shed some light on my false guess.
There are of course points on the plot where both level of density and alcohol are high. For instance, for points with density of over 1.000 and alcohol percentage of over 12%, it can bring the wine to an excellent quality. At this point other chemical properties, or missing factors such as type of grape, or year of manufacturing in the dataset could have helped in drawing a stronger conclusion.
Influence of Volatile Acidity on Quality of Red Wine
Volatile acidity as a factor that affects the taste of the wine, thus influencing its quality, has played a major role in my analysis. As it looks in this plot, the higher the quality of wine, the lower is its volatile acidity level.
Another interesting point in this plot is the range of each quality. Seems below average wines( numbers 3 and 4) that have a low quality are more in quantity than the average and above average wines. This could mean that lower quality (aka. cheaper wines) are more popular in demand; however, this dataset does not provide the information for such conclusions.
There are outliers in wines with quality of 7 or 8 which have high volatile acidity. This could be due to a mistake, or that other chemical factors or factors not mentioned in this dataset could have affected the quality. For instance, aging might have an effect on a specific wine to grow more volatile acidity.
Comparing the Effect of Volatile Acidity and Alcohol on Quality
Volatile Acidity and Alcohol percentage were two of my favorite chemical properties to manoeuvre on in this dataset since both of them has had strong influence on defining the quality of red wine. Putting these two variables in a plot together shows how they can affect quality.
Higher levels of volatile acidity brings lower quality of wine, while higher level of alcohol brings a better quality in wine. However, There are points in this plot that shows for higher levels of volatile acidity and alcohol together, the quality of wine still remained in ‘Below Average’ section. This could mean that the influence of volatile acidity by itself is more significant on quality of wine than the alcohol percentage. To draw a stronger conclusion, I felt the need of more information regarding how volatile acidity can mess up the effect of other chemical properties, which is out of the scope of this project.
Overall, observing and analysing this dataset showed that wines with higher level of alcohol and citric acid has a higher quality, while those with high level of total sulfur dioxide or volatile acidity have lower quality. However, in my analysis, I felt that dataset could not gather enough variables to draw strong conclusions on what influences that quality of red wines.
For instance, factors such as types of grapes, weather, country of origin, or the year the wine was produced could have helped a lot to describe the influence of the chemical properties in this dataset. I also felt that for some of the properties, they could have developed overtime (i.e. aging could have influence on whether the level of some acidic properties got higher in a certain type of wine)